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ABSTRACT 

Instructors vho develop classroom examinations that 
requirt! students to provide a numerical response to a mathematical 
problem are often very concerned about the appropriateness of the 
multiple^choice format. The pre^^ent study augments previous research 
relevant to this concern by comparing the difficulty and reliability 
of mQltiple**choice and completion item formats as applied to the 
classroom measurement of quantitative skills. This Anvestagation also 
includes tvo variations of the multiple* choice format designed to 
reduce cues provided by alternatives. Focus is placed on the external 
validity of the experiment by using an actual examination of course 
material administered to students in a realistic classroom setting^ 
vhen plausible distractors are used, minimal effects on difficulty 
and reliability are observed as a result of using ^none o£ the above** 
or by using ranges of values for alternatives. The results of the 
study also support serious consideration of the math^complef^.on 
format vhen efficiency of scoring is not a major concern. It is shovn 
that fever math*completion items are required for obtaining 
reliability equal to that provided by multiple*choice items. 
Implicajl^ons vhich varying difficulties and reliabil^i^ries have on 
grading standards and test length are discussed. (Author/AL) 
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Authors of educational measurement texts generally favor use of test 
items which require makin'' a choice araong specified alternatives in contrast 
to items which require c. j examinee to produce a limited free response. 
Wesman (1971) recommends against the use of short-answer items concluding 
their superiority over selection-type items is more apparent then real in 
actual testing situations. Ebel (1979) indicates that short-ansvcr items 
CD are usee mainly to test for factual information, and that good objective 

C\l test items do not permit identification of the correct response on cbe basis 

CD of simple recognition or sheer rote me.. Fopham (1981) taLes a more 

LjlJ cautious approach by suggesting a major weakness of ipul tiple-choice items is 

the ability of examinees to recognize correct answers that, without assistance, 
they would not be able to construct. 

instructors who develop classroom examinations that requii'e students to 
j)rovide a numerical response to a mathematical problem are often very concerned 
about the appropriateness of the multiple-choice format. The present study 
augments previous esearch relevant to this concern by comparing the difficulty 
and reliability of multiple-choice and completion item formats as applied to 
the classroom measurement of quantitative skills. This investigation also 
includes two variations of. the multiple-choice format designed to reduce cues 
provided by alternatives. Focus is placed on the external validity of the 
experiment by using an actual examination of cour^. material administered to 
students in a realistic classroom setting. Implications which varying diffi- 
culties and reliabilities have on grading standards and test length are 
discussed. 

Background 

'ihe 1 Lterature contains a limited number of investigations comparing 
nath-com{>letion and various multiple-choice formats. Wesman and Bennett (J 946) 
^ used a multiple-choice test battery administered to nursing school applicants. 

A portion of subjects were administered a modified form of the test in which 
the fifth cilternative was changed to '*none of these." The difficulty and item- 
v) test correlations of test items that measured arithmetic skills were on the 

^ average quite similar for the versions. 
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Kroderickson and Salter (1953) discussed the development of the Navy 
Aritiimetical Computation Test and demonstrated the appropriateness of con- 
structing multiple-choice alternatives from answers generated from completion 
Items. Shifts in item difficulty from the free-answer to the multiple-choice 
forms were found to be relatively small. Rimland and Zwerski (1962) reported 
iUmilar findings in the development of the Navy Arithmetic Test. 



Traub and Fisher (1977) compared the equivalence of construv:ted-response 
And multiple-choice formats on mathematical reasoning and verbal comprehension 
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subtests. Eighth-grade students were initially administered items in the 
constructed-response formal. To control for the retention effect inherent 
in a study by Heim and Watts (1967) using verbal i^ems, Traub and Fisher 
administered items rewritten in the mult j p] e-choice format two weeks later. 
Mean tesit scdres were 3% t^ 6% lower when items were written in the mult ipU^- 
choice format. Alpha reliability coefficients for alternate forms of the 
30-item math test were, with one exception, between .84 and .87. Using a 
procedure suggested by Lord (1971) for assessing equivalence, the tests of 
mathematical reasoning were found to measure the same psychological dimensions 
independent of item forniat. Approximately *nine hours was required in ^he 
Traub and Fisher study to administer the battery of instruments. St ident 
motivation was recognized as a problem within the experimental conditions. 

The present investigation evaluated math-completion and selected multip e 
cl\oice item formats for equivalence in difficulty and reliability when adininis 
tered under conditions representative of classroom examinations. Alternate 
item formats were auministered concurrently to groups of examinees equated 
through random assignment. Multiple-choice options were formulated by the 
instructor us-'.ng experiential knowledge of common errors instead of from 
responses empirically derived from previous free response forms of th^ item. 
*'None of the above" and ranges of numerical responses were investigated as 
possible techniques for reducing the effect providing the student with respons 
options may have on identifying tl\o correct answer. 

Method 

An examination in a business finance course was u?ed in the investigation 
The examination was developed by the instructor using test development and 
item construction principles discussed in most introductory measurement texts. 
The test length varied from 34 to 40 items across the academic terms in which 
the study was conducted. 

Skills assessed by 12 test items were identified for use in the study. 
Each of the 12 items 'r.as written in the following four formats (abbreviated 
identifiers are given in parentheses); 

1 . C ompletion . 

2. Multiple-choice using a single numerical value for each of five 
alternatives; each of the distractors represented common errors 
( 5-Values ) . 

3. Multiple-choice as above, except the fifth alternative was 
replaced with "none of the above" ( N of Above ) . 

4. Multiple-choice using ranges of values incorporating all possible 
values of the examinee's answer; ranges -of each alternative 
respectively encompasses! the five numerical values used above 
(Ranges) . 



A common stem w^s used across the four forms of each test item. The Figure 
illustrates how an item was adapted to each of the formats. 



Insert Figure about here 



Four forms of the examination were prepared. Tab^e 1 describes how the 
12 items included^ in the investigation appeared in the same order within each 



form, but in different formats 'across the four forms. Each triad of items 
usod an A, C, or E as the correct multiple-choice alternative, but not nnces- 
sarily in that order. The 1? items *;ere administered to undergraduate 
business majors as part of a course examination in each of three academic 
terms. The four forms were randomly ordered before being distributed to 
students each term. The total number of students assigned to each of the 
forms is indicated in Table 1. 

All forms of the test shared a common scoring key with the exception of 
items written in the completion format. Responses were recorded by examinees 
on machine readable answer forms except that answers to the completion items 
were initially recorded ih the test booklets. The instructor scored responses 
to the completion items and marked the keyed response (A, C, or E) on the 
student's answer form if the response was found to be correct. Tb "iswer 
forms were then machine scored with all items scored dichotomousl> 

Item p-values were calculated separately for the 12 items written m 
each of the four formats. The weighted mean difficulty was then established 
for each item format. Items incorporating "none of the above" as a response 
alternative were further analyzed by comparing the difference in item diffi- 
culty that occurred as a function of whether this alternative represented Khe 
correct response. 

To facilitate discussion of the findings, four expanded tests were 
conjectured, each consisting of 40 items eouivalent to the completion of one 
of the three multiple-choice type of items included in the present investigc- 
Lion. Setting item difficulty, variance, and covariance consistent vath 
those observed in the study, means and standard deviations of scores on the 
conjectured tests were estimated. Assuming a fixed shape to the distribution 
of scores, percentile ranks associated "with specific criterion scores were 
also estimated for each of the four expanded tests. 

[he KR-20 reliability coefficient was calculated for each triad of items 
within each of the four item formats, and a pooled estimate obtained for each 
format. The Spearman-Brown formula was used to calculate reliabilities for 
40-item tests consisting of equivalent items. The formula was also used to 
determine the ratio of items required for reliability equal to that of the 
completion item format. 



Observed p-values for the 12 items within each of the four formats are 
lifted in Table 2. The items incorporated in the investigation are mostly 



Insert Table 1 about here 



Results 



Insert Table 2 about hore 



of moderate difficulty with the middle 50% of the values ranging between .475 
and .^95. Even with a somewhat restricted range of difficulties, currelations 
between rankings of p-values ranged from .72 to .91. Completion and 5-Values 
had the highest correlations with alternate foniiats, \'hereas N of Abov e had 
«»the "lowest . 

Completion items were consistently the most difficult, with the three 
nuLtiple-choice formats ebbing of near-equal difficulty. Providing ranges of 
values lor alternatives in contrast to specific numerical values did not 
affect item ditficulty overall. Table 3 illustrates ho.v subi>t?.tuting "none 



Insert Table 3 about here 



ot the abv)ve" as an option generally '^ade the item more difficult, almost all 
the increased difficulty occurring when "none ul tne above" was the correct 
an swe r . 

Table 4 presents the means and standard deviations that were projected for 



Insert Table 4 about here 



a 40-item test. Assuming normal distributions of scores for each of the tests 
(a condition that in reality may not be true), percentile equivalents across 
the fcrur formats can be established as illustrated in Table 5. Scores which 



Insert Table 5 about here 



were equivalent to selected percentile ranks for Completion items were computed 
first, and the percentile ranks of these scores for each of the multiple-choice 
formats subsequently determined. For example, a projected score of 16.918 
woii^ld represent the 40th percentile for Completion items, but only 20%, 23%, 
and 19"< of the examinees would be expected to score below this score when 
administered corresponding tests using the respective multiple-choice item 
formats. 

The pooled estimate of reliabilities associated with the four item formats 
is prestnted in Table 6. Estimates of rei Lability based on triads of Items and 
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Insert Table 6 about here 



then pooled across the four forms of the test suggest a discrepancy between 
Complet ion and the multiple-choice formats. Among the three multiple-choice 
formats, S-Values resulted in the highest reliability and Ranges in the lowest 
When adjusted to 40-item tests with the Spearman-Brown formula, all formats 
resulted in high reliabilities. However, Table 6 also indicates that a sig- 
nificant proportion of additional multiple-choice items would be required to 
obtain reliability equivalent to the Completion format. For example, it is 
estimated that 62, 70, and 73 items of the respective multiple-choice formats 
would be required to match the reliability pf 40 Completion items. 

\ 

Dlscussioti J 

Differences in item diriicully die most signXficant between Completion 
and each of the multiple-choice formats. Mean difficulties for the respective 
formats suggest that providing examinees with alternative answers results in 
test scores apprexiniately 20% to 30% higher than when a math-completion format 
is u'cd. (This •''s inconsistent with the findings of other research studies 
referenced previously.) It is probable that examinees rewqrk a problem pre- 
sented in the 5-Value format it the warked solution is inconsistent with all 

— r-* 

five alternatives. If a solution consistent with an alternative can not be 
obtained,, the examinee will likfely choose the alternative perceived most 
consistent with the obtained solution to the problem. Only if the foils are 
able to encompass a high proportion of incorrect solutions or the correct 
solution is perceptually deviant from probable incorrect solutions in a mamer 
not discernible to test-wise behavior would a 5 -Value format not provide the 
examinee with cues to the correct answer. 

The substitution of "none of the above" for the fifth alternative appears 
to have an insignificant effect on item difficulty unless it is the correct 
response. Possibly examinees are leery of using this alternative unless they 
are confident of their solution. Indeed, on an average, the difficulty of the 
N of Above format is very similar to that observed with the Completion format 
when "none of the above" is the correct response. To suggest that "none of 
the above" be used perpetually as the correct alternative is tempting. 

The Ranges 'and 5-V alues formats resulted in equivalent overall item 
difficulties. Ranges does not provide -he same degree of feedback to incor- 
rect solutions as does 5-Values , but may permit selection of the key^d 
response by obtaining a nearly correct solution for the wrong reason. 
will also probably promote caution when an examinee's solution deviates 
dramatically from the ranges of values used for alternatives. Increasing 
ranges of values associated with each alternative wouV ♦'educe the latter 
problem with a consequential increase in the former. 

Estimates obtained from the present study suggest tiiat a distribution of 
test scores will vary noticeably as a fi\ ction of the item format used. 
Table 5 indicates the greatest differences would be expected between 
Complet ion Items and the various multiplo-choice formats. Distributions of 



scores mav not be nornal as was assumed for cals^ulating perceiUiles, however 
differeii eb in means and variability of test sct)res result ing from varying 
itt'M formats probably is sufficiently significant to merit reestablishing 
stanu.-rds if meaningful changes are made in the portions of math completion 
and Kvilt iple-choice items included in tests. 

The reliability of all four item formats is . respectable . However, the 
highor reliability^ of ^he Comp letion 1 Le-ri^. implies that apjfroxi^ately 507 
t) 80"''' additional multiple-choice items art required Loobtain reliability 
equivalence to the math-completion format. The instructor may wish tci deter- 
1.4 ine the point at whi^h cnation of effeci^ive response foils, j^eneraLion of 
additional items, and subsequent need for more time in the classroon to 
adni'nlster longer tests are compensated by the grc^'ter soring efficiency of 
mul t iple-choire items. 

The authors find minimal advantage, wh^u using a multiple-choice format, 
Lo camouflage the correct response by using either a "none of the above" 
rtsponse or by using ranges of numerical values for each alternative. The 
results of the stu^y also support serious consideration of the math-completion 
format when etficiency of scoring is not a major concern. Generalization from 
tills research context to other measurement settings must be done cautiously. 
Subjects included in the present study were fairly competitive college 
students who were being assessed on relatively complex mathematical problems. 
If for example the investigation were replicated with less motivated students, 
selection of a multiple-choice alternative may be more a function of guessing 
as was the case in the Traub-Fisher study. More frequent guessing might 
reduce further the lower reliability of multiple-choice items. 
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Item Sttni If raternal Rate of Return equals 1] percent, 

Profitability Index equals 1, and the Present 
Value of the after-tax cash flows over the 11 
of the project equals $268.13, ^hat is the 
initial cash outlay? 



Response 
Variations 



Completion: 


ANSWER 


5-Values : 


A. 


$268.13 




B. 


$294.00 




C. 


$313.07 




D. 


$326.00 




E. 


^$358.00 


N of Above: 


A. 


$268.13 




B. 


$294.00 




C. 


$313.07 




D. 


$3?6.00 




E. 


None of the above 


Rmges: 


A. 


Less than $275 




B. 


Between $275 and $300 




C. 


Between $300 'and $325 




D. 


Between $325 and $350 




E. 


Greater than $350 



Figure. Illustration of an item adapted to the four formats 



TABLE r 



Fomat of Items and Number ""of Sabjects 
Assigned to Each Form 



- - - Form of Test 



Item 


Key 


A 


B 


C 


1 
2 
3 


C 
A 
E 


Complef^.on 


. 5-Values 


N of AbovC^^ 

^ — — 


4 

0 


A 
E 
C 


Ranges 


Completion 


5-Values 


7 
8 
9 


E 
A 

C . 


N of Above 


9 

Ranges 


Completion 


10 
11 
12 


C 
A 


5-Values 


N of Above 


Ranges 


Number of 
examinees 
administered 
each form 


60 


59' 


57 



56 



s 



10 



TABLE 2 



Item Difficulties Listed 
by Item Format 



Item Completion 5-Values N of Above ^Ranges 

1 .367 ■-, .559 .509 .583 

2 ..483 ' .661 . 737 .542 

3 '.250 ' • .441 .368 .500 

4 .627 .825 .792 .817 

5 .644 .860 .645 .717 
'6 .695 ■ .789 . 667 .750 



7 ;404 .500 .317 .475 

.8 .439 .541 .633 .695 

9 .702 " .875 • .283 .831 

10 .542 .550 .678 .543 

11 .562 .600 .729 .667 

12 .188 .300- - .119 .368 

Average .492 .623 .589 .626 



TABLE J 

Differences in p-Values BetwV'en N of Abov e 
*^ and Other Item Formats 



Average differences 
fojr all 12 items 

Average differences 
for 4 items keyed E 

Average differences 
for 8 items not 
keyed E 



Difference 

from 
Completion 

.098 



.050 



.122 



Difference 
from 
3-Value s 

-.035 



-.086 



-.010 



Difference 

f rom 
^ Ranges 

-.034 



T.075 



-.OlA 



Negative valuf indicates that item presented in N of Above 
format was more difficult than when presented in alternate 
format. _ 



.TABLE 4 



Projected' Means and Standard Deviations 
of 40-Item Tests 



Mean 

Standard 
Deviation 



C ompletion 5-Values N of Above 
' 19.67 p 25.05 23.55 

11.34 9.15 8.89 



Ranges 
24.90 
6.68 



/ 
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TABLE 5 

Projected Percentile Rank Equivalents of 
Selected Scores on 40-item Tests 



Percentile Rank of Score 

S ^* on 

40-Item Test Completion 5-Values N of Above Ranges 

29.223 80 67 74 68 

25.617 70 53 59 52 

22.543 60 40 45 39 

19.674 50 29 33 28 

16.918 ' 40 20 23 19 

13.731 30 12 13 11 

xO.125 20 6 7 5 
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TABLE 6 



Reliability Associated with 
Various Ite^m Formats 



Reliability estimates 
pooled across forms 

Reliability adjusted 
to a 40-ltem test 

Proportion of items 
required for 
reliability equ*'aler.t 
to Completion iurmat 



Completio n 
.572 

.947 

1.00 



5-Values 



.465 



N of Above 



.432 



.921 



1.54 



.910 



1.76 



Ranges 



.423 



.907 



1.82 



